NLP (Natural Language Processing)¶
Introduction¶
- NLP - Text analysis, Sentiment analysis
- Natural Language Processing (NLP) is a branch of
artificial intelligence (AI)that focuses on enabling computers to understand, interpret, and generate human language in a way that is both meaningful and useful.
NLP can be broadly categorized into two types:¶
Natural Language Understanding (NLU):¶
it helps the machine to understand and analyse human language by extracting the metadata from content such as concepts, entities, keywords, emotion, relations, and semantic roles. NLU mainly used in Business applications to understand the customer's problem in both spoken and written language.
Natural Language Generation (NLG):¶
Natural Language Generation (NLG) acts as a translator that converts the computerized data into natural language representation. It mainly involves Text planning, Sentence planning, and Text Realization
How to compresse the text data¶
- Paragrph into sentance
- sentance into word
- word anlysis (stop words - meaningless word ) - Remove stop words
NLP Preprocessing steps:¶
- Text cleaning
- Lowercase
- Remove punctuation
- Tokenization
- sentance tokenization
- word tokenization
- stop word removing
- stemming & lemmetization
Cleaning¶
In [1]:
### Using sumbols
import string
puntuations = string.punctuation
In [2]:
puntuations = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
In [3]:
my_str = '(Hi, Hello!!!, How are you)'
# remove putuations
no_punt = ''
for i in my_str:
if i not in puntuations:
no_punt = no_punt + i
In [4]:
no_punt
Out[4]:
'Hi Hello How are you'
In [5]:
## Using regex
import re
In [6]:
re.sub(r'[^\w\s]', '', my_str)
Out[6]:
'Hi Hello How are you'
In [7]:
## Tokenization
import nltk # Natural Language Tool kit
nltk.download('punct')
nltk.download('wordnet')
[nltk_data] Error loading punct: Package 'punct' not found in index [nltk_data] Downloading package wordnet to [nltk_data] C:\Users\sathy\AppData\Roaming\nltk_data... [nltk_data] Package wordnet is already up-to-date!
Out[7]:
True
In [8]:
text = '''Backgammon is one of the oldest known board games.
Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East.
It is a two player game where each player has fifteen checkers which move
between twenty-four points according to the roll of two dice.'''
In [9]:
## Sentance tokenization
sentance = nltk.sent_tokenize(text)
sentance
Out[9]:
['Backgammon is one of the oldest known board games.', 'Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East.', 'It is a two player game where each player has fifteen checkers which move \nbetween twenty-four points according to the roll of two dice.']
In [10]:
## word tokenization
for w in sentance:
word = nltk.word_tokenize(w)
print(word)
['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games', '.'] ['Its', 'history', 'can', 'be', 'traced', 'back', 'nearly', '5,000', 'years', 'to', 'archeological', 'discoveries', 'in', 'the', 'Middle', 'East', '.'] ['It', 'is', 'a', 'two', 'player', 'game', 'where', 'each', 'player', 'has', 'fifteen', 'checkers', 'which', 'move', 'between', 'twenty-four', 'points', 'according', 'to', 'the', 'roll', 'of', 'two', 'dice', '.']
Stop words¶
In [11]:
from nltk.corpus import stopwords
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to [nltk_data] C:\Users\sathy\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date!
Out[11]:
True
In [12]:
stop_words = stopwords.words('english')
print(stop_words)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
In [13]:
### remove stop words
sen = 'Backgammon is one of the oldest known board games.'
word = nltk.word_tokenize(sen)
word
new_sen = [w for w in word if not w in stop_words]
new_sen
Out[13]:
['Backgammon', 'one', 'oldest', 'known', 'board', 'games', '.']
Stemming & Lemmetization¶
- stemming - stem - chops the end of the word
- Ex : songs - song singing - sing
In [14]:
# stemming
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
input_text = 'had been changing cities mice'
input_word = nltk.word_tokenize(input_text)
input_word
Out[14]:
['had', 'been', 'changing', 'cities', 'mice']
In [15]:
for i in input_word:
print(i, ':', stemmer.stem(i))
had : had been : been changing : chang cities : citi mice : mice
In [16]:
### Lemmatization
from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()
input_text = 'had been changing cities mice'
input_word = nltk.word_tokenize(input_text)
input_word
Out[16]:
['had', 'been', 'changing', 'cities', 'mice']
In [17]:
for i in input_word:
print(i, ':', lem.lemmatize(i))
had : had been : been changing : changing cities : city mice : mouse
POS Tagging¶
In [18]:
from nltk import word_tokenize
word = word_tokenize('The sky is blue')
word
Out[18]:
['The', 'sky', 'is', 'blue']
In [19]:
nltk.pos_tag(word)
Out[19]:
[('The', 'DT'), ('sky', 'NN'), ('is', 'VBZ'), ('blue', 'JJ')]
NLP Techniques¶
- spelling correction
- Bag of Words
- TF - IDF (Term Frequency - Inverse Document Frequency)
Spelling correction¶
In [22]:
# !pip install autocorrect
In [24]:
from autocorrect import Speller
spell = Speller()
In [25]:
spell('amberlla')
Out[25]:
'umbrella'
In [26]:
spell('ur')
Out[26]:
'ur'
Bag of Words¶
- Bag of words model helps convert the text into numerical representation (numerical feature vectors) such that the same can be used to train models using machine learning algorithms.
- Here are the key steps of fitting a bag-of-words model:
- Create a vocabulary indices of words or tokens from the entire set of documents. The vocabulary indices can be created in alphabetical order.
- Construct the numerical feature vector for each document that represents how frequent each word appears in different documents. The feature vector representing each will be sparse in nature as the words in each document will represent only a small subset of words out of all words (bag-of-words) present in entire set of documents.
In [27]:
doc = ['Good movie', 'Worst movie', 'Good movie and Good screenplay']
doc
Out[27]:
['Good movie', 'Worst movie', 'Good movie and Good screenplay']
In [34]:
from sklearn.feature_extraction.text import CountVectorizer
# design a vocabulary
count_vect = CountVectorizer()
count_vect
# create bag of words model
bag_of_words = count_vect.fit_transform(doc)
feature_name = count_vect.get_feature_names_out()
print(feature_name)
print(bag_of_words.toarray())
['and' 'good' 'movie' 'screenplay' 'worst'] [[0 1 1 0 0] [0 0 1 0 1] [1 2 1 1 0]]
In [35]:
import pandas as pd
pd.DataFrame(bag_of_words.toarray(), columns=feature_name)
Out[35]:
| and | good | movie | screenplay | worst | |
|---|---|---|---|---|---|
| 0 | 0 | 1 | 1 | 0 | 0 |
| 1 | 0 | 0 | 1 | 0 | 1 |
| 2 | 1 | 2 | 1 | 1 | 0 |
ngram¶
In [38]:
# design a vocabulary
count_vect = CountVectorizer(ngram_range=(1,2))
count_vect
# create bag of words model
bag_of_words = count_vect.fit_transform(doc)
feature_name = count_vect.get_feature_names_out()
print(feature_name)
print(bag_of_words.toarray())
['and' 'and good' 'good' 'good movie' 'good screenplay' 'movie' 'movie and' 'screenplay' 'worst' 'worst movie'] [[0 0 1 1 0 1 0 0 0 0] [0 0 0 0 0 1 0 0 1 1] [1 1 2 1 1 1 1 1 0 0]]
TF-IDF¶
Term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
In [39]:
doc = ['Good movie', 'Worst movie', 'Good movie and Good screenplay']
In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
x = vect.fit_transform(doc)
name = vect.get_feature_names_out()
pd.DataFrame(x.toarray(), columns=name)
Out[43]:
| and | good | movie | screenplay | worst | |
|---|---|---|---|---|---|
| 0 | 0.000000 | 0.789807 | 0.613356 | 0.000000 | 0.000000 |
| 1 | 0.000000 | 0.000000 | 0.508542 | 0.000000 | 0.861037 |
| 2 | 0.463121 | 0.704430 | 0.273526 | 0.463121 | 0.000000 |
Spam mail ham mail classification¶
In [44]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import wordcloud
In [46]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
stop_word = stopwords.words('english')
In [48]:
# read the dataset
df = pd.read_csv('email spams.csv', encoding='latin')
df
Out[48]:
| title | Message | |
|---|---|---|
| 0 | ham | Go until jurong point, crazy.. Available only ... |
| 1 | ham | Ok lar... Joking wif u oni... |
| 2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... |
| 3 | ham | U dun say so early hor... U c already then say... |
| 4 | ham | Nah I don't think he goes to usf, he lives aro... |
| ... | ... | ... |
| 1334 | ham | Oh... Icic... K lor, den meet other day... |
| 1335 | ham | Oh ! A half hour is much longer in Syria than ... |
| 1336 | ham | Sometimes we put walls around our hearts,not j... |
| 1337 | ham | Sweet, we may or may not go to 4U to meet carl... |
| 1338 | ham | Then she buying today? Ü no need to c meh... |
1339 rows × 2 columns
In [49]:
# apply label encoder
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
df['title'] = LE.fit_transform(df['title'])
In [51]:
df.head()
Out[51]:
| title | Message | |
|---|---|---|
| 0 | 0 | Go until jurong point, crazy.. Available only ... |
| 1 | 0 | Ok lar... Joking wif u oni... |
| 2 | 1 | Free entry in 2 a wkly comp to win FA Cup fina... |
| 3 | 0 | U dun say so early hor... U c already then say... |
| 4 | 0 | Nah I don't think he goes to usf, he lives aro... |
In [52]:
# check info
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1339 entries, 0 to 1338 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 1339 non-null int32 1 Message 1339 non-null object dtypes: int32(1), object(1) memory usage: 15.8+ KB
In [53]:
# check duplictaes
df.duplicated().sum()
Out[53]:
42
In [54]:
# drop duplicates
df.drop_duplicates(inplace=True)
In [55]:
df.shape
Out[55]:
(1297, 2)
EDA¶
In [57]:
df.title.value_counts(normalize=True)
Out[57]:
title 0 0.855821 1 0.144179 Name: proportion, dtype: float64
In [62]:
plt.pie(df['title'].value_counts(), labels=['ham', 'spam'], autopct='%0.2f')
Out[62]:
([<matplotlib.patches.Wedge at 0x15934125d90>, <matplotlib.patches.Wedge at 0x15934135610>], [Text(-0.9890754300356166, 0.4813832087847064, 'ham'), Text(0.9890754300356165, -0.4813832087847066, 'spam')], [Text(-0.5394956891103363, 0.26257265933711255, '85.58'), Text(0.5394956891103362, -0.26257265933711266, '14.42')])
In [63]:
sns.countplot(x='title', data=df)
Out[63]:
<Axes: xlabel='title', ylabel='count'>
Preprocessing¶
In [66]:
# word tokenization
def no_of_words(text):
word = text.split()
word_count = len(word)
return word_count
In [67]:
df['no of words'] = df['Message'].apply(no_of_words)
df.head()
Out[67]:
| title | Message | no of words | |
|---|---|---|---|
| 0 | 0 | Go until jurong point, crazy.. Available only ... | 20 |
| 1 | 0 | Ok lar... Joking wif u oni... | 6 |
| 2 | 1 | Free entry in 2 a wkly comp to win FA Cup fina... | 28 |
| 3 | 0 | U dun say so early hor... U c already then say... | 11 |
| 4 | 0 | Nah I don't think he goes to usf, he lives aro... | 13 |
In [69]:
fig, ax = plt.subplots(1,2, figsize=(10,5))
ax[0].hist(df[df['title']==1]['Message'].str.len(), label='spam', color = 'red')
ax[0].legend(loc='upper left')
ax[1].hist(df[df['title']==0]['Message'].str.len(), label='ham', color = 'green')
ax[1].legend(loc='upper right')
Out[69]:
<matplotlib.legend.Legend at 0x15938f27690>
In [74]:
# create text processimg function
def data_processing(text):
text = text.lower()
text = re.sub(r'[^\w\s]', '', text)
text = re.sub(r'<br />', '', text)
# tokenization
text_token = word_tokenize(text)
# stop words removal
filtered_text = [w for w in text_token if not w in stop_word]
return " ".join(filtered_text)
In [75]:
df['new message'] = df['Message'].apply(data_processing)
df.head()
Out[75]:
| title | Message | no of words | new message | |
|---|---|---|---|---|
| 0 | 0 | Go until jurong point, crazy.. Available only ... | 20 | go jurong point crazy available bugis n great ... |
| 1 | 0 | Ok lar... Joking wif u oni... | 6 | ok lar joking wif u oni |
| 2 | 1 | Free entry in 2 a wkly comp to win FA Cup fina... | 28 | free entry 2 wkly comp win fa cup final tkts 2... |
| 3 | 0 | U dun say so early hor... U c already then say... | 11 | u dun say early hor u c already say |
| 4 | 0 | Nah I don't think he goes to usf, he lives aro... | 13 | nah dont think goes usf lives around though |
In [76]:
# create function for lemmatization
lem = WordNetLemmatizer()
def lemmetizing(text):
data = [lem.lemmatize(word) for word in text]
return text
In [77]:
df['lematizing_data'] = df['new message'].apply(lemmetizing)
df.head()
Out[77]:
| title | Message | no of words | new message | lematizing_data | |
|---|---|---|---|---|---|
| 0 | 0 | Go until jurong point, crazy.. Available only ... | 20 | go jurong point crazy available bugis n great ... | go jurong point crazy available bugis n great ... |
| 1 | 0 | Ok lar... Joking wif u oni... | 6 | ok lar joking wif u oni | ok lar joking wif u oni |
| 2 | 1 | Free entry in 2 a wkly comp to win FA Cup fina... | 28 | free entry 2 wkly comp win fa cup final tkts 2... | free entry 2 wkly comp win fa cup final tkts 2... |
| 3 | 0 | U dun say so early hor... U c already then say... | 11 | u dun say early hor u c already say | u dun say early hor u c already say |
| 4 | 0 | Nah I don't think he goes to usf, he lives aro... | 13 | nah dont think goes usf lives around though | nah dont think goes usf lives around though |
In [78]:
df['word_cunt(lem)'] = df['lematizing_data'].apply(no_of_words)
df.head()
Out[78]:
| title | Message | no of words | new message | lematizing_data | word_cunt(lem) | |
|---|---|---|---|---|---|---|
| 0 | 0 | Go until jurong point, crazy.. Available only ... | 20 | go jurong point crazy available bugis n great ... | go jurong point crazy available bugis n great ... | 16 |
| 1 | 0 | Ok lar... Joking wif u oni... | 6 | ok lar joking wif u oni | ok lar joking wif u oni | 6 |
| 2 | 1 | Free entry in 2 a wkly comp to win FA Cup fina... | 28 | free entry 2 wkly comp win fa cup final tkts 2... | free entry 2 wkly comp win fa cup final tkts 2... | 23 |
| 3 | 0 | U dun say so early hor... U c already then say... | 11 | u dun say early hor u c already say | u dun say early hor u c already say | 9 |
| 4 | 0 | Nah I don't think he goes to usf, he lives aro... | 13 | nah dont think goes usf lives around though | nah dont think goes usf lives around though | 8 |
In [80]:
# ham mail
ham_mail = df[df['title'] == 0]
ham_mail.head()
Out[80]:
| title | Message | no of words | new message | lematizing_data | word_cunt(lem) | |
|---|---|---|---|---|---|---|
| 0 | 0 | Go until jurong point, crazy.. Available only ... | 20 | go jurong point crazy available bugis n great ... | go jurong point crazy available bugis n great ... | 16 |
| 1 | 0 | Ok lar... Joking wif u oni... | 6 | ok lar joking wif u oni | ok lar joking wif u oni | 6 |
| 3 | 0 | U dun say so early hor... U c already then say... | 11 | u dun say early hor u c already say | u dun say early hor u c already say | 9 |
| 4 | 0 | Nah I don't think he goes to usf, he lives aro... | 13 | nah dont think goes usf lives around though | nah dont think goes usf lives around though | 8 |
| 6 | 0 | Even my brother is not like to speak with me. ... | 16 | even brother like speak treat like aids patent | even brother like speak treat like aids patent | 8 |
In [82]:
text = ' '.join([word for word in ham_mail['lematizing_data']])
from wordcloud import WordCloud
plt.figure(figsize=(20,15))
wordcloud = WordCloud(max_words=200).generate(text)
plt.imshow(wordcloud)
Out[82]:
<matplotlib.image.AxesImage at 0x1593613a810>
In [83]:
from collections import Counter
count = Counter()
for text in ham_mail['lematizing_data'].values:
for word in text.split():
count[word] += 1
count.most_common(15)
Out[83]:
[('u', 204),
('im', 102),
('get', 72),
('ok', 67),
('like', 63),
('dont', 61),
('2', 60),
('got', 59),
('time', 56),
('know', 55),
('ill', 53),
('call', 53),
('go', 52),
('ltgt', 50),
('come', 50)]
In [84]:
ham_word = pd.DataFrame(count.most_common(15))
ham_word.columns = ['word', 'Count']
ham_word
Out[84]:
| word | Count | |
|---|---|---|
| 0 | u | 204 |
| 1 | im | 102 |
| 2 | get | 72 |
| 3 | ok | 67 |
| 4 | like | 63 |
| 5 | dont | 61 |
| 6 | 2 | 60 |
| 7 | got | 59 |
| 8 | time | 56 |
| 9 | know | 55 |
| 10 | ill | 53 |
| 11 | call | 53 |
| 12 | go | 52 |
| 13 | ltgt | 50 |
| 14 | come | 50 |
In [85]:
import plotly.express as px
px.bar(ham_word, x='Count', y='word', title='Common words in ham_mail', color ='word')
In [86]:
# task = do for spam mail
In [87]:
x = df['lematizing_data']
y = df['title']
In [88]:
vect = TfidfVectorizer()
x = vect.fit_transform(df['lematizing_data'])
In [89]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
In [90]:
x.shape
Out[90]:
(1297, 3968)
In [91]:
from sklearn.linear_model import LogisticRegression
In [92]:
lr = LogisticRegression()
lr.fit(x_train, y_train)
Out[92]:
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
In [93]:
lr_pred = lr.predict(x_test)
lr_pred
Out[93]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0])
In [94]:
from sklearn.metrics import accuracy_score
In [95]:
accuracy_score(y_test, lr_pred)
Out[95]:
0.9115384615384615
In [96]:
# apply all other alogithms
In [ ]: